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Abstract 

This review article considers some of the most common methods used in astronomy for regressing one 
quantity against another in order to estimate the model parameters or to predict an observationally ex- 
pensive quantity using trends between object values. These methods have to tackle some of the awkward 
features prevalent in astronomical data, namely heteroscedastic (point-dependent) errors, intrinsic scat- 
(-h \ ter, non-ignorable data collection and selection effects, data structure and non-uniform population (often 

O ^ called Malmquist bias), non-Gaussian data, outliers and mixtures of regressions. We outline how least 

square fits, weighted least squares methods, Maximum Likelihood, survival analysis, and Bayesian meth- 
ods have been applied in the astrophysics literature when one or more of these features is present. In 
particular we concentrate on errors-in-variables regression and we advocate Bayesian techniques. 
Keywords: Astrophysics; Bayesian statistics; Errors-in-variables; Measurement errors; Regression; 
Scaling relations. 

1 Introduction 

In astrophysics, the term "scaling relation" is taken to mean a relationship between different physical values 
of astronomical objects such as, stars, galaxies, galaxy clusters and black holes. In more statistical terms, 
the term indicates a relationship between two variables, Equation (Q]), along with an associated measure 
of variability associated with that relationship. Some well-known examples are the Tully-Fisher relation 
(between galaxy luminosity and circular velocity), the Faber-Iackson relation (between galaxy luminosity 
and velocity dispersion), the fundamental plane (between galaxy brightness and velocity dispersion and 
size), the Magorrian relation (between the mass of a black hole and the bulge mass). These relationships 
are of considerable interest because from their existence and from their parameters we can elaborate or 
test theories about the object formation and evolution. At a high level, we can make inference from these 
relationships about the Universe's birth and fate, or the mysterious dark energy, etc. For example, the 
Cepheid period-luminosity scaling informs us about the physics regulating these objects, and, at the same 
time, in the 1930's demonstrated that the Universe is larger than our own Galaxy, that nebulae are outside 
our own Galaxy and that the Universe is formed by galaxy-islands (HI)- 

Unfortunately it is rarely possible to measure values completely without error. Errors may be deter- 
ministic (e.g. rounding to the nearest integer), stochastic or both. In this article, we concentrate on the 
problem of stochastic noise and we review the interesting statistical questions which arise and the associ- 
ated methodology currently in the literature. We begin with a simple example to illustrate some of the points 
which may occur. Suppose we are given N pairs of observations, x° bs and corresponding y° bs for data points 
i = 1, N. The x° bs are observations on the true values Xj, while the yf 8 are observations on the true 
values yi. It would be common in this field to plot y ohs against x obs to try to learn something either about the 
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objects under study, or, via them, about a bigger objective, such as the geometry of our Universe. Expressed 
more formally, estimating the relationship between the the true values X{ and yi is of interest 

Vi = f{x h 9) (l) 

where 9 represents any parameters of the relationship. In the simplest case, the function / could be a linear 
relationship, with 9 representing the slope a and intercept b: 

Ui = axi + b. (2) 

If the {xi} are actually observed without error (in other words, x° bs = x i) and the {y.;} are each corrupted 
by an additive Normal, or Gaussian, error with constant variance, y° bs = y% + ei where ~ N(0, a 2 ), for 
i = 1, . . . , N, then this problem is the familiar linear regression (the notation ~ reads "is distributed as" 
throughout). Sadly for astronomers, but perhaps happily for statisticians, things are rarely this simple. 

In this article, we will begin in Section [2] by discussing some of the reasons why astronomers are inter- 
ested in regression. Section[3]describes some of the complicating features that are common in astronomical 
data. In light of these features, we review the current state of the literature, mostly astronomical, in Section|4] 
Section [5] will compare the performances of some of these regression methods on simple data. This review 
only addresses scaling relations in astronomy between physical values, given the breadth of the applications 
and the volume of statistical methodology involved we do not attempt to tackle spatial or temporal models. 



2 Why astronomers regress 

There are a number of different reasons why astronomers may have an interest in regressing one quantity 
against another. These include 

• Parameter estimation. It may be that the parameters of the relationship between x and y are them- 
selves the primary interest, as in the slope of the cluster of galaxies Lx — T (X-ray luminosity vs 
Temperature) scaling relation. Usually, we are interested in both the parameter estimates and their 
estimated uncertainty. When the relationship is important because of what it implies for a higher- 
level question (for example, Universe geometry), then it is particularly important to carry forward 
the parameter uncertainty into any higher-level inference. We emphasise that, most of times, we are 
interested in the relation between the true values x and y, rather than between the observed values 

x obs and y obs_ 

• Prediction. It may be the case that there is a considerable difference in either the difficulty or the cost 
of measuring x obs and y obs . Provided there is a trend between the two values and not too much scatter, 
it might then be desirable to predict the expensive values of y (say) using the cheaper measurements 
of x, x obs . Two examples from astronomy are the use of colours to predict redshift (see ||2l for 
example), and the use of one of the many cheaper values to predict mass (for example richness ll3l). 
The emphasis here is on assessing the uncertainty associated with the predicted values of y. 

• Model selection. It may be that the form of the relationship between x and y is of primary interest, 
that is determining the particular f(x,9) in Equation (fl}, with all this implies for the astrophysics. 
The important topic of more formal model choice is outside the scope of this review, see instead for 
example H and 121. 

To illustrate how the goals of an analysis can alter what may be considered a good model, Figured] shows a 
set of 500 points drawn from a bivariate Gaussian where marginally both x and y are standard Gaussians with 
mean and variance 1 and x and y have correlation 1/2. What is the "best" line explaining the relationship 
between x and y here? If we are trying to summarise the relationship itself, then the blue y = x line shown 
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Figure 1: Left panel: 500 points drawn from a bivariate Gaussian overlaid by the line showing the expected 
value of y given x; the yellow band covers those y for which x is close to 2. Central panel: Distribution of 
the y values for x values in a narrow band of x centred on 2, as shaded in the left panel. Right panel: as the 
left panel, but adding the lines giving the expected x values at a given y, and the x = y line. 

on the right hand panel seems reasonable. But is this also good if instead we were trying to predict a new 
value of y given a value of x? Superimposed in red on the left hand panel of Figure Q] is the line giving 
the theoretical conditional expectation of y given x (known theoretically for this bivariate Gaussian to be 
y = x/2). Why might this be a better line for the purposes of prediction? Consider a small range of x values 
close to 2 indicated in yellow. The middle panel indicates the corresponding y values of the points captured 
in this way (overlaid on the theoretical distribution of y\x = 2, where the symbol | will be used throughout 
to indicate "conditional on"). The average of the y values falling in the shaded area is closer to the value 
predicted by the red line (a value of 1 in this case) than to the value predicted by the blue, y = x, line (2 in 
this case). Although this line then is good for prediction, it perhaps seems too shallow with respect to the 
overall trend identified by the points, which is better captured by the x = y blue line. 10 provides a nice 
discussion of these issues, and we offer a further example in Section [5] 

Figure [U taken from 0> illustrates one further point, namely the asymmetry of regression. If, rather 
than predicting y from x, one wishes to do the reverse and predict x given y, then the line defined by the 
conditional expectation of x given y may be useful. Shown in red in the third panel, this line is different again 
from the previous two. In this example, the symmetry of the marginal distributions of x and y is reflected in 
the symmetry of the two red lines. However there is an underlying conceptual difference between the two 
predictions given that we regard one variable as a predictor and the other as a response. In a slightly more 
complex setting than this, Q emphasises that while this is known to many astronomers, not all appreciate 
the consequences of the difference between the direct and inverse fit in the context of the Magorrian relation 
(where x = L is the galaxy luminosity and y = M. is the black hole mass). Astronomers may often find 
themselves in situations where a symmetric treatment of variables would be more natural than a "predictor- 
response" approach. 

3 Common features of astronomical data 

3.1 Heteroscedastic error structure 

Much standard statistical theory deals with the situation where all observations are made with the same ac- 
curacy in the sense that the sizes (variances) of the errors are assumed equal, say c& for all the observations, 
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Figure 2: (a) Mass (M) vs Richness (n) (taken from [3]); these data represent galaxy clusters, each point 
corresponding to a different cluster. Mass is one of holy grails in astronomy but it is observationally expen- 
sive or even impossible, to measure, whilst at the same time being needed in many studies because most 
properties depend on mass. The easily observed quantity richness (n) could in this respect be useful to 
predict mass, (b) Wavelength shift (1000 * (AA/A — 3.0248)), vs sensitivity (k) to physical constant (data 
taken from JHJ); the points correspond to different measurements on a single object, the quasar 3C295. One 
of the pressing questions in astronomy and physics is whether physical constants are indeed constant or take 
different values at different locations in the Universe or at different times. The slope of the regression, if 
different from zero, implies that (a specific combination of) physical constants are not taking a unique value 
everywhere in the Universe, being different between Earth laboratories and the very distant Universe probed 
by the plotted experimental points. In both panels the solid line marks the fitted regression, taking into ac- 
count the intrinsic scatter as well as the measurement noise, while the dashed lines show this plus or minus 
the estimated intrinsic scatter a sca t derived from the Bayesian analysis in 0. Error bars on the data points 
represent the scale (size) of measurement errors, indicating one standard deviation. Distances between the 
data and the regression are due to both measurement error and intrinsic scatter. (The shaded regions mark 
68% highest posterior credible intervals for the regression.) 



e-g 

yf s = Vi + €i where e< ~ JV(0, oj), i = 1, . . . , JV. (3) 

Such a simplifying assumption is unlikely to be valid in many astronomical studies where a heteroscedastic 
error structure may have to be adopted, e.g replacing Equation ([3]) by 

y? s = y l + e l where e { ~ N(0, a 2 ^), i = l,...,N. (4) 

The notation a 2 i is adopted here for later compatibility with situations where the related variable x° bs also 
exhibits heteroscedastic error. There are a number of reasons why Equation © may be more appropriate, 
sometimes because the N objects may not be observed under identical conditions, for example, under vari- 
able atmospheric conditions, sometimes because of object related differences, for example other values of 
the object such as luminosity, colour, size, etc. As an additional complication, it is possible for the error 
structure on the i th object, x% and yu to be correlated; Q discusses the example of the Tully-Fisher relation 
where an inclinational collection applied to a galaxy affects both its luminosity and velocity errors. 

3.2 Intrinsic scatter 

Equations {T]) and (0 assume that the relationship between the iV objects' values {x{\ and {yi} is determin- 
istic, albeit unknown. In many cases, this clean relationship is not representative and there is also an intrinsic 
scatter (a stochastic variability) to consider. This extra layer of variability is distinct from the (possibly het- 
eroscedastic) measurement noise, indicating instead that the objects under study are examples drawn from a 
population with a spread of values. Extending the model in Equation (O to include intrinsic scatter, the full 
set of equations becomes 

xf s = Xi + e X)i where e x>i ~ N(Q, a 2 xi ) 

yf S = Vi + £y,i wher e e y ,% ~ N(0, tr^) 
y t ~ N(axi + b,a 2 scat ). (5) 

At this point a formal distinction between the two types of measurement error found in the literature should 
be noted since we have made a choice by specifying Equation ©: the situation above where a distributional 
assumption is made about the observed data given the true values is called the "classical measurement error" 
model; if the reverse is true and the distributional assumptions are made about the true values given the 
observed ones, then the model is known as "Berkson error" ( iflQl . Chapter 1). We will work throughout with 
the classical model, because the latter is usually available in physics and astronomy and called calibration 
when dealing with instruments: one injects x in the apparatus and measure the distribution of observed data, 
p(x obs \x). 

Two examples where intrinsic scatter is relevant are shown in Figure |2] The left panel of Figure |2] 
shows a mass vs richness scaling where the variability between objects is greater than that simply due to 
measurement errors and is more likely due to differences of the physics on the objects under study (||3"1). 
In other cases, the scatter may be due to unaccounted systematic effects. The right hand plot reports the 
ratio between the wavelength of several spectral lines compared to the wavelength measured in laboratory, 
(^obs ~ <^iab)/^iab = AA/A against the parameter k, measuring the sensitivity of the considered spectral 
lines to the physical constant k (HI). The intrinsic spread visible is due to a source of error still not identified, 
because all these measurements pertain to a single object and so there is no "population" variability effect. 

3.3 Non-ignorable data collection and selection effects 

In an ideal world, the sample we have is randomly selected from the population we wish to study, and so 
any inference we draw about the sample applies to the population; in other words, we can ignore the way in 
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Figure 3: A simulation (from ifTTTO built around a true Lx — T data set and selection function (from |[T2l ). 
illustrating how the observed data may be unrepresentative compared to the underlying scaling relation due 
to selection effects. The solid blue line shows the true relationship but due to the selection effects, most 
points lie above the line. 

which the data were collected. Unfortunately, data collection is very often non-ignorable. As an example, in 
(3 studying black hole masses, galaxies at high redshift are selected by their nuclear activity while galaxies 
at low redshift are selected by luminosity or velocity dispersion. But as that paper points out, these two 
subpopulations are not homogeneous because of the spread in luminosity at any given black hole mass and 
therefore the effect of the selection criteria may be different for the two. In this particular case, ignoring 
this selection effect difference leads to biased estimation. At this point, it may be worth giving a formal 
definition of bias. Statisticians define the "bias" of estimators to be the difference between the expected 
value of the estimators and the true value of the parameters; however the term "biased" is also widely used 
to refer to the non-representative nature of a sample (effectively the source of the statistical bias). 

Selection effects can arise in a number of forms: a) Object values may be partially missing in a non- 
random way although the object is in the catalogue. This means we can say something about the missing 
object, e.g. its position in the sky, but we are lacking a measurement of the quantity of interest; b) The 
object itself is missing from the catalogue (not at random), and often we can say little or nothing about what 
is missing, including any idea of how many objects are missing; c) In an extension of the above scenarios, 
objects can be missing with a probability different from zero or one, for example objects above/below some 
threshold enter/exit the sample with some probability p, where p is usually a smooth function of the object's 
parameters. This case is illustrated in in Figure [3l taken from ifTTI . At all temperatures T, clusters brighter 
than average are easier to detect (simply because they are brighter) and therefore they are over-represented 
in most samples, while those fainter than the average are underrepresented. Therefore, at a given T the mean 
Lx of clusters in the sample is higher than the average Lx of the population at the same T. The selection is 
stochastic: it is possible to miss an object above the threshold and yet still to record some objects below the 
threshold. This effect is self-evident and is called the "bias" of the X-ray selected clusters in |[T3l . and the 
"halo model" in lfl4l . but its consequences are not always fully appreciated, as emphasised in [13]. We will 
return to these ideas in Sections l4~6l and l6l 

Since collecting samples with known (and thus hopefully correctable) selection functions is observa- 
tionally hard, the issue of selection effects and modelling data collection is often deferred until a selection 
effect becomes apparent in the data themselves. Often, astronomers are forced to study the property of 
the population entering in the sample (SDSS or Galex galaxies, PG QSO, EROs), instead of the property 
of the population sample, because either the selection function is poorly known or it is so severe that it is 
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Figure 4: Posterior distribution (black curve) of the source flux, having observed four photons and knowing 
that source number counts have an Euclidean distribution, that is their probability distributions follow a 
power law (shown by the blue curve). The highest posterior 95% interval of the source flux is shaded in 
yellow (adapted from |[T8l ). 

impossible to make inference about the wider population. 
3.4 Data structure and non-uniform populations 

Even when the data collection does not contain selection problems, neglecting the population structure can 
induce estimation biases: iTTSTl and lfT6l reminded the astronomical community that even symmetric errors 
(for example, Gaussian likelihoods) have an effect, later called the Eddington or Malmquist bias. 

Figure @] illustrates the problem using an example from the X-Bootes survey ( 11171 ) of X-ray sources 
performed with the Chandra Space Telescope. The observation here s obs is a count measurement on the 
underlying flux, s. As such, we can assume that given the underlying flux s, the data point s obs is distributed 
as a Poisson random variable, s obs ~ V(s). However it is also known from previous accurate observations, 
that the probability distribution of s (known as the number counts by astronomers) is a power law with slope 
—2.5 (in the standard astronomical units this slope is referred to as the Euclidean slope). This distribution 
is illustrated in blue on Figure 0] showing that small values of s are more likely than large ones. Therefore, a 
source with s obs is more likely overall to be a fainter source with an overestimated flux than a stronger one 
which has been underestimated. This effect has nothing to do with the asymmetric Poisson distribution, it 
would still be there if a Gaussian noise structure were assumed. Using Bayesian methodology (as described 
in Section [6]), Figure |4]shows the number counts and the posterior distribution of the object flux for a source 
with four observed photons, s obs = 4, typical of the X-Boote survey. The posterior distribution has its mode 
at s = 1.5 and mean at s = 2.5, in agreement with astronomical experience that the true flux, s, of a source 
is likely to be lower than the observed value, s obs . 

Whether or not a Bayesian approach is preferred, most astronomers would agree that some form of bias 
correction is often needed. Such bias is widespread in astronomy, applying not only to fluxes, but almost 
to all quantities which have an abundance gradient, such as galaxy cluster richnesses (number of galaxies) 
or masses, parallaxes (i.e. the small angular displacement of stars due to the Earth's orbit around the Sun), 
velocity dispersions (how fast galaxies move inside clusters of galaxies or how fast stars move in globular 
clusters). This has an obvious impact on the determination of regression parameters of scaling relations 
using these quantities, but its consequences are not always fully appreciated, as emphasised in 0. 
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Figure 5: The B — V colour vs R magnitude relation of galaxies (from ED), where B — V, short for blue 
minus visual magnitudes, is a colour index used in classification and the R magnitude is —2.5 times the 
base 10 log of the source flux plus a constant. The spread in colour around the mean regression, called in 
astronomy the colour-magnitude relation, is a measure of heterogeneity in the star-formation histories of 
red cluster galaxies; the larger the spread, the more variable their star formation histories are. Top panel: 
diagram for galaxies in a random line of sight, showing a realization of contaminating galaxies. Bottom 
panel: diagram for interesting galaxies belonging to the cluster Abell 1185, plus contaminating galaxies, 
i.e. galaxies in the cluster line of sight but unrelated to the cluster. Which individual point is interesting 
rather than contaminating is initially unknown. As a result interesting galaxies form a regression heavily 
contaminated by outliers (there are 4 contaminants per interesting point). The solid line is the mean colour- 
magnitude relation of interesting galaxies fitted from the lower plot and overlaid on both plots, the narrow 
yellow area marks the 68% highest posterior interval (lower plot only). 

3.5 Non Gaussian data, outliers and other complications 

Astronomical data, like all data, can be affected by a number of other problems which negatively affect a 
regression analysis. For start, it is quite common for the error structure to be other than Gaussian, particularly 
in situations where small numbers of counts are recorded (even for large numbers of counts, a Gaussian 
approximation may be not acceptable, see |[T9l ). For example, in the left-hand panel of Figure |2]the x-axis 
plots the log of a quantity which is (close to being) Poisson distributed. 

Another very common problem is that of outliers and contaminated samples; as soon as one collects 
a sizable sample, it is quite possible to also collect data for objects having little to do with the population 
of interest. When these objects fall far away from the model suggested by the regression, they are called 
outliers and some sort of outlier identification and removal may be needed, although removal may only be 
essential in cases where the outlier is influential (see EUl . Chapters 1 to 5, for more information on assessing 
outliers as well as other important diagnostic aspects of regression modelling). 

There can be situations where the level of contamination by data points not following the hypothesised 
regression model makes it hard to identify the true population under study. Figure [5] gives an example of 
such a situation where the sample of interest is overwhelmed by other points. In this case more powerful 
methods are needed to identify and remove the contaminating sample ([21] and 11221 ). 

The sample could also contain several subpopulations, each following their own regression line (indeed 
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this is one way of viewing large scale contamination such as that in Figure [51 with points either being in the 
group of interest or in the contaminating subpopulation). Such a mixture of regressions greatly complicates 
the fitting process (see |23l for examples of such mixtures of regressions in non-astronomical examples and 
ll22l for an astronomical application). 

4 Commonly used regression techniques in astronomy 

In this section we will summarise the various approaches to regression commonly used in astronomy. Gener- 
ally in cases where parametric models are to be used, it is possible to label approaches as either "frequentist" 
or "Bayesian". The former bases inference on the likelihood of how the data arise, while the later works 
with the posterior distribution of the parameters given the data using some prior information in the model 
building. Consequently, the rationale behind parameter estimation differs between the two, as does the 
subsequent assessment of parameter uncertainty. In a frequentist setting, Maximum Likelihood Estimation 
(MLE) is usually preferred, that is choosing the parameter estimates which maximise the likelihood. In a 
Bayesian setting, parameter estimates are chosen as appropriate summaries of the posterior distribution, for 
example the posterior mean or the posterior mode. Other parameter estimation rationales are also in use, for 
example Method of Moments, Least Squares or Robust estimates; a recent overview is given by ll24l . We 
review how some of this methodology have been used in the astronomy literature, attempting to highlight 
strengths and weaknesses. 

4.1 Ordinary Least Squares regression 

The method of fitting a regression line y = ax + b by ordinary least squares regression (OLS) is extremely 
widespread in many disciplines ((25j). Based on the rationale of minimising the sum of squared residuals 
(i.e. the difference between the observed data and the values fitted to each point by the model), S = 
J2iLi{Vi bs — (b + axi)) 2 , the parameter estimates are easily calculated as 

Cov(x,y obs ) 

a = — — 

Var(:r) 

b = y obs -ax (6) 

where we deliberately express the estimators in terms of the estimated moments of x and y obs for compati- 
bility with later estimators. 

As a rationale for parameter estimation, OLS does not require any underlying distributional assumptions 
to be made. Finding uncertainty estimates associated with these parameter estimates does however require 
some such assumptions (as a minimum the mean and variance of the error structure for y obs ). In the case 
that the {x-i} are observed without any error and yi errors are all Gaussian, of the same size, and there is no 
intrinsic scatter (i.e. that the rather simplistic Equation Q holds) then OLS is equivalent to fitting the model 
using Maximum Likelihood. In this case a great deal of theory exists for how the OLS estimators behave 
statistically. A well developed suite of diagnostic devices exists for spotting when the assumptions are 
violated. Diagnostic plots and diagnostic statistics followed by omission of the offending points can be used 
to tackle outlier problems. However as stressed by one of the anonymous referees "simple strategies like 
outlier detection and elimination based on diagnostic plots tend to work reliably only in simple regression 
with one predictor and a few outliers - but even then problems arise because little can be said about the 
distribution of the final parameter estimators based on such deletion procedures, so that the use of such 
estimators should not be encouraged." 

OLS is also not very robust to the other complications we might reasonably expect to crop up in as- 
tronomical data, Section |3j For example, when there is error on the predictor variable (i.e. x obs is a noisy 
version of x) which is uncorrected with the error on the response variable, then the estimate of the slope 
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of the regression line a is statistically biased towards zero (see, for example, iflOl . Chapter 4). This has an 
obvious negative effect when we are primarily interested in assessing scaling relations. As a result, various 
alternatives have been developed and we describe these in the following sections. 



4.2 Weighted least squares fits 

OLS minimises the sum of the squared residuals of fitting the line ax + b to the data y obs . If the y errors are 
homoscedastic (i.e. all with variance equal to a 2 ), minimising the sum of the squared residuals is equivalent 
to minimising 

When there is intrinsic scatter and the observation noise is actually heteroscedastic, the logical next step in 
many astronomical works is to minimise a weighted sum of residuals 

a ^ (yf s - (b + ax,)) 2 

x =2_j —2 W 

i=l i 

where a 2 = cr 2 cat + a 2 ^ represents the intrinsic scatter plus the heteroscedastic measurement noise in the 
notation of Equation (0) and these values are assumed known. See |[26l for numerical details of minimising 
this penalty. 

Of course, we are also likely to be facing observational errors on the {xi}, so that in fact we observe 
instead the {x ohs }. If the observation noise on both variables and the intrinsic scatter are statistically inde- 
pendent and known, one may introduce a new reweighted penalty 

2 = y (V? s - (b + ax?*)) 2 (9) 

We stress that the standard notation x 2 does not imply that the statistic follows the distribution of the same 
name. This change in weighting arises because following Equation ©, 

T 2 ■) 



yf s = Vi + % where e Vi ~ N(Q, 

= (axi + b + t scat ) + e yi where e scat ~ N(0, a 2 cat ) 

= a(x° bs - e Xi ) + b + e sca t + % where e Xi ~ N(0, a 2 x i ) 

soVar{y° bs ) = a 2 a 2 Xfl + a 2 where as before a 2 = a 2 scat + a 2 y i . (10) 



The optimisation is complicated by the a 2 term on the denominator which makes the problem nonlinear, |[26l 
provides a suitable numerical optimisation algorithm known as FitEXY. Minimisation of this reweighted 
penalty requires that the {of} and {a 2 { } are known. Weighted least squares fits cannot address an unknown 
intrinsic scatter by including the parameter when minimising over all the unknown parameters (if <r 2 cat is 
unknown, then doing this, the minimal x 2 obviously occurs for o 2 cat — > oo). |[27l points out that finding 
associated statistical properties of this estimator can be achieved via bootstrapping. 



4.3 Maximum Likelihood Estimation 

Putting a greater emphasis on parametric modelling than does OLS, Maximum Likelihood Estimates (MLEs) 
are those parameter estimates which maximise the likelihood of the observed data. They are much favoured 
by statisticians not least because of the wealth of associated theory which yields (asymptotic) sampling 
distributions in many cases and thus interval estimates and hypothesis tests. 



10 



In several of the scenarios considered in the previous two sections, MLE and OLS or weighted least 
squares estimates of the regression parameters a and b would coincide. For example, reconsidering the case 
where the {a 2 } are known, there is no observation noise on the {xj}, and the Normality assumptions of 
Equation §5$ hold, then the likelihood for for a and b is 

n x.s tt 1 ( (y° bs - axi - b) 2 \ 

L(a, b) = H - 1 == exp ( - W — ) . (11) 
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It is clear that the maximisers of Equation (fTTT) are also the minimisers of x 2 denned in Equation ® ([28']). 

The situation is more complicated if the intrinsic scatter, cr 2 cat , has to be determined from the data or if 
there is observation noise on the {xj}; this latter errors-in- variables regression has been considered in the 
statistics literature; for example, |[29l . Chapter 3, and ||30l . Chapter 12, have nice discussions of different 
approaches to the problem. In the case where the errors for both x and y are normal and homoscedastic, the 
complete likelihood can be written 

L(a, b, {xi},a x ,a y ) = (2iry/ a*a$) exp( — 2 ) exp( ^ ) (12) 

x y 

where a 2 represents the variance of the x error and a 2 represents the sum of a 2 cat and the common variance 
for the y observations. They point out that by taking each X{ = x° bs and letting a 2 — > 0, the likelihood 
can be made arbitrarily large. As a result, maximum likelihood tends only to be performed after making 
additional assumptions about the variances, most commonly that a 2 = Xa 2 where A is a known positive 
constant (although this assumption is not without its problems, EH)- Alternatively, if we were to assume 
known heteroscedastic variances a 2 i and a 2 i5 the corresponding likelihood would be 

L{a, b, {Xi\, a scat ) = — p=== === exp( J exp' 



nil: ^lM, + ° 2 scat) 2 °h 2 «, + 

(13) 

which could be maximised numerically over the + 3 unknown parameters, although again it is not clear 
whether the known variances assumption is justifiable. Equation (fT3l) has the unappealing property that the 
number of unknowns, A+3, increases as the number of data points, {x° bs , y° hs }f =l , increases. In particular, 
this means that the usual asymptotic distributional results for MLEs do not hold. 

4.4 Robust estimation 

Parameter estimation can be affected by outliers and other violations of parametric assumptions. Robustness 
to outliers and to some aspects of the Normality assumptions can be introduced by a switch from least 
squares to least absolute deviations, replacing the standard sum of squares by (in its simplest case) 

N 

S = J2\yi bs -(b + a Xi )\. (14) 

i=l 

This form is quite commonly used in astronomy, being offered in ||2"6l ; see also |[32l for a witty introduction. 

The fitted value b is here the median of the y° hs — (b + axi) and the estimate of the slope can be found 
numerically as the solution of the equation 

JV 

= sign(yf s - (S + axi)). (15) 

8=1 

(There may not be a unique solution to this problem.) Least absolute deviations regression is quite robust 
against unusual y° hs but it does not cope so well with unusual X{ values (high leverage points) and can be 
quite unstable with regard to small changes in the {xi}. However it is just one possible robust approach and 
this is an active field of research, see for example 11331 . Chapter 7. 
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4.5 Bivariate Correlated Errors and Intrinsic Scatter (BCES) 

In 1996 @ proposed a widely applicable extension of OLS, known as BCES. It considers regressions with 
heteroscedastic and even correlated errors on both axis as well as an intrinsic scatter. Again the assumption is 
made that the measurement error variances and covariances are known, although they may vary from point to 
point, and importantly they may be a function of the response and predictor variables. The BCES estimator 
extends OLS by considering the usual estimates in terms of the moments of the bivariate distribution of x 
and y, Equation ©, and then expressing these moments in terms of the moments of the observed x obs and 
y obs and the known measurement variances. In analogy with Equation (O, the BCES estimates of slope and 
intercept when regressing y on x are 

Var(x*) - i Eili alt 
b = y° bs -ax obs (16) 

where a 2 xy i is the known covariance between e x ^ and e y ^ and i is the known variance of e Xj j (as de- 
fined by Equation ©). These estimates are referred to as the "Moment-based corrected estimators" in ifTOl , 
Chapter 4, where fuller discussion is given regarding their properties. Asymptotically, the slope and inter- 
cept estimators are shown by (9[ to have Gaussian sampling distributions and so confidence intervals can be 
estimated. As an alternative, BCES also assesses the fitting errors using bootstrap ideas (repeatedly resam- 
pling from the observed data to form an empirical sampling distribution of the estimators). See the original 
paper for more complete details. 

Equation (fT6l) provides point estimates for the regression of y on x. @ also derive estimates for when 
the regression of x on y is preferred. As the two resulting lines will not be equal, the BCES bisector is also 
defined as the bisector of the two. 

BCES addresses some of the difficulties arising in scaling relations in astrophysics, namely errors on 
measurements which are potentially correlated as well as being functions of the true x and y. However, 
problems related to outliers, data structure, upper-limits and selection effects are not addressed by this 
methodology, and in small data sets non-Normality of the data may be an issue for exact confidence intervals. 
Code to implement these estimators is available via 

http : / /www .astro. wise. edu/~mab/ar chive/ stats /stats . html. 

4.6 Astronomy Survival Analysis (ASURV) 

In the wider world of statistics, there are many application areas in which data are censored and the field of 
survival analysis has developed to deal with such data (see (34l for example). As the name perhaps suggests, 
survival analysis tends to deal with cases where observations can only be made when the subject survives 
more or less time than a certain observation date. As such, the analogy with flux or other attribute limited 
selection is quite clear. The analogy is not perfect however when we are thinking about upper/lower limits 
or selection function, because these are stochastic (the inclusion/exclusion occurs with some probability), 
while censoring is deterministic. 

j35l and ll36l introduced the astronomical community to these ideas thereby allowing astronomers to 
perform regression in the presence of upper/lower limits (following the paper f37l on a similar theme for 
univariate problems). (35l leads the reader through a series of statistical procedures in the presence of 
censoring, of which we will concentrate here on regression with the assumption of upper-tail censoring on 
both variables and a known number of potential observations. The suggested procedure begins by finding 
a nonparametric estimate of the joint distribution of the two variables. To do this, the data are first binned 
on each variable separately, including a bin for the censored values. Then considering the intersection 
of the two sets of bins, four distinct classes emerge: points for which both measurements are available, 
points for which only one variable is available with the other censored (the two cases), and points for which 
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both values are censored. Based on counts of the numbers of points in each class, l35l derives Maximum 
Likelihood estimates of the probability of each bin combination. This stage does not make any parametric 
assumptions about how the data are distributed. Finally, given the estimated joint distribution, the moments 
of the distribution can be found and used in the usual OLS formulations, Equation ©. See the original paper 
for more complete details. 

j36l reviews the methodology of ll35l . adding two further approaches. In the first, data are assumed only 
to be censored on the response variable, and a distributional assumption of Normality is made for the re- 
sponse variable then using the EM algorithm [38] to impute the censored values for use in a Maximum Like- 
lihood approach. In the second, a nonparametric approach is taken, relaxing the Normality assumption but 
again only assuming censoring on one variable, making it perhaps less widely applicable than the approach 
of IT351 . The paper compares the various approaches via simulation studies and on several astronomical 
applications. A useful resource written by some of the authors of this paper, amongst others, is code to im- 
plement these estimators, available at http : //astrostatistics .psu . edu/ statcodes/asurv 
and described in |[39l . 

The explicit handling of censoring greatly extends the range of regression problems which can be tackled 
appropriately, as does the nonparametric approach of some of the methods. However some limitations 
remain; there is not currently provision for incorporating intrinsic scatter nor heteroscedastic error structure, 
nor the other common features of astronomical data in Section [3] In particular, ASURV only considers a 
specific type of upper limit censoring, which is a sharp upper limit. As already mentioned, the most usual 
case in astronomy is a soft, probabilistic, threshold, one in which also some objects below the threshold may 
be observed while some objects above the threshold will be missed. 

4.7 Errors-in-variable Bayesian regression 

Bayesian regression started to receive significant attention in the astronomical community after the publica- 
tion of l40l . which is a brief summary of the relevant content in BTTl . and the errors-in- variable model of 
ll42l . The important differences for the practitioner between Bayesian and non-Bayesian approaches lie in 
the model specification and interpretation of results. In terms of the former, a Bayesian approach takes the 
usual question "How did the data I observed arise?" and adds the new question "What do I know about the 
parameters even before I collect any data?". Effectively it asks the practitioner to quantify their uncertainty 
about the parameters in the form of a prior distribution (which may be either very precise or, more com- 
monly, very vague). This is combined with the usual likelihood for how the data arise via Bayes' theorem 
to form a posterior distribution which now quantifies our uncertainty about the parameter having observed 
data. In a mathematical sense, the posterior distribution is simply proportional to the product of the likeli- 
hood function and the prior distribution. Summarising the posterior by its location (mean, median, mode, 
etc) quantifies the parameter value; its spread, e.g. the interval that includes 68% of it, quantifies the param- 
eter uncertainty. On the practical side, there are now a number of Bayesian packages which allow the user 
to specify relatively general models without needing to compute the posterior in closed form (i.e. as a single 
mathematical expression whose derivation is often difficult). Amongst them, perhaps the most notably are 
BUGS 63 and JAGS O. 

Despite the philosophical differences, since a Bayesian approach and fitting by Maximum Likelihood 
share a common model for how the data arise, it is often the case that, numerically, the two sets of resulting 
point estimates can be quite close, especially if there is a lot of informative data or if the prior information 
is weak. This is entirely positive. Where a Bayesian analysis may really help is when perhaps we have 
strong prior information or wish to explore the uncertainty associated with relaxing some condition required 
for Maximum Likelihood estimation (for example, the constant ratio of variances assumption). It may also 
permit some dimension reduction if the latent variables {x{\ are not of interest by integrating them out of the 
posterior distribution analytically, as is done by HOI . For example, considering Equation (fT3l . the posterior 
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for just a, b, a 2 cat can be expressed 



p{aA<cat\{xr,yr}) = / P(a,b,{x i },ai cat \{xr,yr})d{x i } 



oc / L(a,b,{xi},a scat )p({xi},a,b) d{xi} (17) 

J{ x i\ 

where p({xi}, a, b) is the joint prior and the integration is over the N dimensional latent variables. Obviously 
this dimension reduction approach will only be feasible in certain analytically tractable cases. When analytic 
dimension reduction is not an option, numerical marginalisation is required, either directly by numerical 
integration or indirectly by Monte Carlo methods. 

To illustrate the Bayesian approach, a version of the errors-in- variable regression model assuming known 
{a 2 j} and {a 2 ^ takes the usual likelihood equations 

ATI*. ~2 



N(xi,a 2 i ) Gaussian errors on x obs 



y° hs ~ N(yi, a 2 J Gaussian errors on y obs 



where y, L ~ N(axi + b, crj cat ) Gaussian intrinsic scatter around a straight line for y (18) 

and adds the equations specifying the additional question "What do I know about the parameters even before 
I collect any data?", for example, we could specify that independently 

Xi ~ U (— 10 4 , 10 4 ) Uniform population structure for Xi, i = 1, . . . , N 



r 2 

scat 



InvGam(0.01, 0.01) Inverse Gamma prior on scatter 



a ~ t\ Student's t on slope, equivalent to a uniform on angle 

6~ N(0, 10 4 ) Gaussian prior on intercept. (19) 

(The choice of an Inverse Gamma prior for the variance term is quite common but is not without criticism, 
B31 . The numerical values in the uniform prior are meant to indicate realistic limits for a particular applica- 
tion rather than being general purpose.) Precise information could be introduced for particular applications. 
For example we may know from previous analyses that the slope is % ± 0.1, and in such a case we could 
use a ~ N(ir, 0.1 2 ); the prior is one way to take advantage of the work of previous generations of re- 
searchers. Code to implement the error-in- variable Bayesian regression is given in the Appendix. Other 
common features of astronomical data can be addressed, as we detail in Section [6l 

Turning to one of the other uses of regression mentioned in Section |2l prediction, before data z are 
collected (or even considered), the distribution of predicted values z can be expressed as 

p(z) = J p{z,6)d6 = J p{z\9)p{0)d0. (20) 

These two equalities result from the application of probability definitions, the first is simply that a marginal 
distribution results from integrating over a joint distribution, the second is Bayes' rule. If some data z 
have been already collected for similar objects, we can use these data to improve our prediction for z. For 
example, if mass and richness in clusters are highly correlated, one may better predict the cluster mass 
knowing its richness than without such information simply because mass shows a lower scatter at a given 
richness than when clusters of all richnesses are considered (except if the relationship has slope exactly equal 
to tan kir/2, with k = 0, 1, 2, 3). In making explicit the presence of such data, z, we rewrite Equation (l20l) 
conditioning on z 

p(z\z) = I p(z\z,9)p(9\z)d6. (21) 



The conditioning on z in the first term in the integral simplifies because z and z are considered conditionally 
independent given 0, so that this term becomes p(z\Q). The left hand side of the equation is called the 
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posterior predictive distribution for a new unobserved z given observed data z and model parameters 9. Its 
width is a measure of the uncertainty of the predicted value z, a narrower distribution indicating a more 
precise prediction. Examples are given in Section [5] 

5 Performance comparisons 

In this section we consider existing and new comparisons of some of the different regression techniques. 
Most comparisons tend to use relatively simple simulated data sets because many of the regression tech- 
niques do not address more than a few of the features of astronomical data listed in Section [3] We split our 
review into performance in estimating the regression and performance in prediction. We have not included 
ASURV in the comparisons because it addresses only censoring of all the possible data complications and 
this is relatively uncommon in astronomy, which deal instead mostly with a soft, probabilistic, threshold. 

5.1 Estimating the regression parameters 

||27l compares performance in recovering the regression parameters for OLS, BCES and FitEXY and a like- 
lihood model he calls the Gaussian structural model proposed in that paper. For OLS and FitEXY poor 
performances might be anticipated because they are applied outside their range of validity since the simu- 
lated data has heterogeneous errors on both x and y (not addressed by OLS) and a unknown intrinsic scatter 
(not addressed by FitEXY). In finding the Maximum Likelihood Estimates based on the Gaussian structural 
model he compares asymptotic theory to derive uncertainty estimates with a semi-Bayesian approach draw- 
ing samples from the associated posterior when using very weak priors. ll27l found that OLS returns slopes 
which are statistically biased towards zero, in agreement with [93, while BCES and FitEXY return intrinsic 
dispersions which are systematically wrong. This is unsurprising for FitEXY and OLS since they do not ad- 
dress the simulated case (we emphasise that FitEXY allows an intrinsic scatter only if known). The Gaussian 
structural model in its semi-Bayesian approach (ignoring the prior when computing the point estimate of the 
parameters, but using it to compute their uncertainties) outperforms the other methods, including BCES. 

|[T8l compares two methods only, BCES and the errors-in- variable Bayesian model 11421 of Section 14771 
In this case, the comparison uses models within the range of validity, generating 1000 samples of 25 data 
points drawn from a linear regression with slope a = 5, an intrinsic scatter a 2 cat = 1, and homogeneous 
Gaussian errors with a 2 = 1 and a 2 = OA 2 . These parameters are chosen to approximate X-ray luminosity 
vs velocity dispersion of clusters of galaxies e.g. Roll . Three examples are shown in Figure [6] [QO finds 
that BCES sometimes estimates the slope parameter very poorly (see the right hand panel in Figure [6]for an 
example and the left hand panel in Figure [7J for an ensemble view). When BCES does estimate the slope 
poorly, it does also return a much larger estimated measure of error, right hand panel in Figure [7J On average 
over the 1000 simulations, the Bayesian approach has a smaller measure of error than BCES. 

Computationally, the costs of the different techniques vary considerably, OLS being the cheapest and the 
Bayesian errors-in- variables the most expensive as it requires Markov chain Monte Carlo methods (MCMC). 
In this later case, depending on the size of the data set, the cost may be up to a few minutes or hours 
depending on the complexity of the model rather than a few seconds. 

5.2 Prediction 

Turning to a comparison of prediction accuracy, we consider a simulation study involving a simpler scenario 
than that in the previous section. We generate 100 values of x% from a non-central scaled Student-t distri- 
bution with 10 degrees of freedom, location 0.1, and scale 1.1. These x,- L are perturbed by homoscedastic 
Gaussian noise with a 2 = 1 to give 100 x° bs . The regression line takes the form y = x with no intrinsic 
scatter, and these yi are also perturbed by homoscedastic Gaussian noise with a 2 = 1 giving 100 y° bs . The 
resulting {x° bs ,y° bs } pairs are used as the data for the errors-in- variable Bayesian model, OLS, weighted 
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(a) (b) (c) 

Figure 6: Three simulated data sets each of 25 points, showing true trends from which the data are generated 
in red, the trend recovered by BCES in blue, and the mean Bayesian estimate in green. In 1000 simulations, 
BCES results performed worse than those shown in the central and right panels about 10% of the times. The 
simulations are meant to mimic the Lx — cr scaling of galaxy clusters (from |[T8l ). 




Estimated slope quoted slope error 



Figure 7: Comparison between BCES (blue) and the Bayesian approach (red) for a linear regression problem 
(from lfl"8l ) using 1000 simulations of a sample of 25 objects. Left panel: Histogram of point estimates of 
the slope parameter, note that the extreme bins represent < — 10 and > 30. In approximately 12% of the 
simulations, BCES returns significantly wrong estimates, the fit associated with one of these is plotted in the 
right hand panel of Figure [6l Right panel: Histogram of estimated variability of slope estimates; for BCES 
this is the returned standard error, for the Bayesian approach it is the returned standard deviation of the 
posterior mean for the slope. Note that estimates larger than 50 are binned for display purposes at 50. The 
two arrows denote the standard deviations calculated from the 1000 simulations of the observed differences 
between the true and the estimated slopes. 
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least squares, BCES and MLE (in all cases the variances of the noise are assumed known). The likelihood 
for the unknowns a, b, {xi} based on the observations {xf s , y° bs } can be written 

L(a, b, { Xi }) = — == exp( — 2 ) exp( — ~ 2 ) (22) 

Noting that for MLE for this experiment the likelihood has 102 unknown parameters. We follow BD1 in 
using the likelihood 

t( h\ ^Kl ( Eli(y° bs -ax° bs -b)\ 

L(a,b) = =exp( j—^ ;r L ^- — ) (23) 

which corresponds to integrating out the {x^. Notice the correspondence between the exponential term in 
this equation and Equation ©, although they differ in that Equation (l23l also has an a dependence in the 
leading non-exponential term. Therefore, when there are errors on the predictor quantity, weighted least 
squares is not the MLE estimate. 

For the Bayesian errors-in- variable model we retain the {x{\ in the model and so will need to assume a 
prior population model for them. Retaining the {x^ in the model does increase the computational cost al- 
though as a byproduct it does also generate estimates of the {x{\ which might be useful in some applications. 
In some astronomical problems, the x data structure is fairly well known (e.g. from previous experiments or 
from theory). If this were the case here, the prior would be 

Xi ~ti (0.1,l.l 2 ), i = 1, .... 100 (24) 

since this is actually the true population model. Of course it may well be the case that we do not have such 
a high level of prior knowledge and so we also consider an alternative prior 

Xi~ N(0, l 2 ), i = 1,...,100 (25) 

which has a difference in location and scale as well as a lighter tail behaviour as an example of possible 
mismatch between the true data structure and what we know about it. This prior is still relatively informative; 
it might perhaps be arrived at in a not strictly Bayesian sense by looking at a histogram of the 100 x° bs 
values. Finally, we also consider a less informative prior, with parameters to be determined at the same time 
as the other regression parameters; we have adopted a Normal prior with parameters treated as additional 
unknowns 

Xi ~ N(fl prior , CFprior), i = 1, . . . , 100 (26) 

with weak hyperpriors on the parameters 

Vprior ~ iV(0,10 4 ) (27) 
l/a 2 prior ~ £7(0,10). (28) 

Using the observed data, the Bayesian model makes predictions of y, y, for new x obs using Equation ((2TI) . 
The non-Bayesian techniques have point estimates of the slope a and intercept b using the original 100 
points, and these are used to give predictions y = ax obs + b for new x obs . To assess performance, 10000 
new x° bs values are simulated from the model and for each we calculate the residual between the predicted 
yi and the actual j/j. Figure [8] plots these residuals against the value of x obs for the various different fitting 
algorithms (binning the results in small x obs intervals to aid visualisation). Notice that we plot the residuals 
against x obs rather than against x; although this induces a correlation with the residuals for some methods, 
it is actually the performance for an observed x obs rather than an unobserved x which may be of interest 
to practitioners. This also emphasises the point made by [29], Chapter 2, that homoscedastic measurement 
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Figure 8: Performance of the various approaches in predicting the true value y given x . Each point show 
the average residual per small bin of x obs . The error bars indicate the standard error of the mean residuals. 



error is less of a problem for prediction than it is for parameter estimation because one is predicting on the 
basis of x obs and not x and thus methods which ignore the distinction are not penalised to the same extent. 
In this simulation, that makes OLS a competitive standard. 
Running through the results as laid out in Figure [8} 

Top row, left Bayes regression with a iio(0.1, l.l 2 ) prior performs very well with small residuals across 
the entire range of x obs as is to be expected given that the correct model is being fitted. The point 
estimates of the parameters themselves are 1.09 and 0.00 for the slope and intercept respectively. 

Top row, right Suppose that one naively attempts to predict y by y = x obs , in effect making use of the 
knowledge that y = x but ignoring the noise structure. The residual in this case is the perturbation 
x — x obs , i.e. a N(0, 1) value in this example, however it is as a result negatively correlated with x obs 
so that the largest residuals occur at the extremes of x° bs 

Second row, left OLS generates better predictions than the naive estimate. It achieves this better perfor- 
mance by underestimating the true slope, returning an estimated slope of 0.65 in place of the input 
1 and an intercept of 0.02 in place of 0, in agreement with j9} and Section [2] Feeling that this fitted 
line is not steep enough to capture the data trend (see discussion on Figure [Qfor a related case), some 
astronomical papers average this slope with the one derived by swapping x obs with y obs . However 
while this may be more appropriate for parameter estimation, it is suboptimal for prediction. In this 
case this gives a new fitted line with a slope near to one, as in our y = x obs case, and consequently 
the performance of this averaged OLS method is close to the naive estimator case. 

Second row, right Predicted y values computed using the BCES method are similar to those using the naive 
y = x obs method. Comparing Equation (fl"6l ) when a 2 xy { = with Equation © it is apparent that in 
improving the estimates of the slope and intercept, the prediction performance is degraded. 
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Table 1 : Mean residuals at various values of x 1 





obs r obs 

x = —5 x 


= —3 


x obs = — 1 


x ob3 = 1 


x obs _ g 


x obs = 5 


Bayes tio(0.1, l.V) prior 

x oba 

OLS 

BCES 

MLE 

X 2 /FitEXY 

Bayes N(/j,, a 2 ) prior 

Bayes N(0, l 2 ) prior 


0.98 
2.16 
0.44 
2.43 
0.65 
2.43 
0.45 
0.30 


0.06 
1.10 
0.07 
1.29 
0.20 
1.29 
0.07 
-0.01 


0.04 
0.47 
0.11 
0.58 
0.17 
0.58 
0.11 
0.09 


-0.07 
-0.40 
-0.09 
-0.37 
-0.12 
-0.37 
-0.09 
-0.06 


-0.17 
-1.14 
-0.15 
-1.19 
-0.26 
-1.19 
-0.16 
-0.08 


-0.15 
-1.36 

0.30 
-1.49 

0.11 
-1.49 

0.29 

0.42 



Any uncertainty associated with the predictions for a fixed value of x° bs is associated purely with the uncertainty of x corresponding 
to this x (and thus the corresponding y). This is not an issue for between-method comparison and so is omitted. 

Third row, left MLE using Equation (l23l) returns reasonable predictions. 

Third row, right x 2 /FitEXY, Equation (O performs less well than MLE because Equation (l23l) has a lead- 
ing term missing in Equation (©. 

Bottom row, left and right Bayesian errors-in-variables regression with either a N(fi, a) prior for the {xj} 
or a iV(0, 1) prior performs almost as well as the Bayesian model using the true distribution as a prior, 
showing that an accurate knowledge of the prior is not necessary to achieve good performance in this 
example. 

In addition to Figure [U Table Q] lists residuals at a few points (|x obs | = 1, 3, 5) for all our comparison 
performances. To summarise for this example the methods based on sound statistical models applied ap- 
propriately (MLE, Bayesian regression) outperform BCES and weighted least squares. OLS also does well 
here paradoxically because of its sheer simplicity. Of these methods, the Bayesian regression also generates 
estimates for the {xj}, if these are required, as well as reliable estimates of the regression parameters (un- 
like OLS). If we were talking about predicting observationally expensive masses (y) from observationally 
parsimonious mass proxies (x obs ), such as richness or X-ray luminosity, non-Bayesian fitting models are 
either inaccurate in predicting the trend between mass and proxy (i.e. return an incorrect slope of the y vs 
x relation) or perform very badly in predicting masses from observed values of mass proxies. The Bayesian 
fit estimates the slope accurately and provides well predicted masses, and so may achieve better results for 
the same costly telescope time, albeit at greater computational cost. 

6 Including more features of astronomical data in Bayesian regression 

Section [3] listed many common features of astronomical data. In this section we consider how we might 
expand the Bayesian errors-in-variable model to include some of these features and others. In some cases, 
it is a question of changing the likelihood of how the data arise. In others, it is the prior which must change. 
Notice that in the former case, a non-Bayesian approach might also work with this same modified likelihood, 
usually at the cost of a more complex fitting by Maximum Likelihood. 

• Heteroscedastic errors on both x and y are already accounted for in the errors-in-variable model. 
However so far they have been assumed known. Very often, uncertainties are not perfectly known 
because of the complexity of determining them (based on properties of the mechanism by which 
the data are obtained, rather than by a statistical repeat sampling approach). Q extends the simple 
example model to allow for this uncertainty. Denoting the value for the variance suggested by the 
physics by j, a prior is constructed which reflects both the positivity of oi i and the accuracy of 



19 



°v,i ~ S ~fxl (29) 

The degrees of freedom of the distribution, v, control the spread of the distribution, with large v 
meaning that quoted variances will be close to the true variances. |3j uses v = 6 to quantify with 95% 
confidence that quoted standard deviations are correct up to a factor of 2 (i.e. i < < 2). Notice 
that it is not possible to put a flat prior on these terms since there is a lack of identifi ability associated 
with partitioning the variability of the y ° bs between <jy i and a1 cat . 

Intrinsic scatter. This is taken to be Normal in the standard errors-in-variable model, but it could 
readily be replaced with something more suitable for the particular application, for example with a 
Cauchy distribution, y, L ~ Cauchy(axi + b,a^ cat ) if the scatter between x and y is thought to be 
exceptionally heavy tailed (the Cauchy distribution is equivalent to the t on 1 degree of freedom). 
An application is provided by the computation of the color offset of stellar locii ( ||471 ): allowing 
heavier-than-Gaussian tails one may account for contamination by QSO and objects with corrupted 
photometry. 

Non-ignorable data collection and selection effects. In this case, the problem is the same for Bayesian 
and non-Bayesian methods since it is the likelihood of the observed data which is affected. R8l 
nicely describes a modified likelihood taking into account selection effects. Suppose we introduce a 
new binary variable Di for observation i where Di = 1 if i is observed and zero otherwise and Di 
depends only on the value of the underlying The likelihood of observing a value y° bs , denoting 
other parameters of the model collectively by 9, can then be expressed 

f(ll obs ln „ m /(A = %)/(yf%,6>) 

J-oo/CA = Mz)f{z\y u e)dz 

where f(y° bs \yi, 0) is the usual likelihood when there are no selection effects (for example, N(yi, J 
in the usual errors-in- variable model) and f(D{ = l\yi) is the sample selection function (strictly 
related to the sky coverage). The integral in the denominator of Equation d30l gives the probability 
of detection given the values yi and 0. Working with this likelihood complicates both Bayesian and 
non-Bayesian methods. 

An application is provided by the Lx — T relation of Figure[3j where the simulated data are generated 
from the blue line, then selected according to the XMM-LSS selection function lfT2ll . In this example, 
the probability that an object enters the sample, f(Di = changes smoothly from zero to one 

depending on luminosity and temperature as it does with real data; it does not go abruptly from zero 
to one at a given threshold as assumed in survival analysis. Using the modified likelihood, the input 
regression can be estimated (see iTTTTl for details). 

Data structure and non-uniform population. This problem is readily addressed in the Bayesian setting 
by choosing the most appropriate prior for the population. For example when dealing with the source 
of the Malmquist bias in faint object measurements, it might be appropriate to use a common power 
law, as done in the case illustrated in Fig. l4l( lfT8l . see also B9l ). 

Non-Gaussian data and upper limits. Non-Gaussian data requires a change of likelihood model for 
x obs , y obs , or perhaps both. As an example, consider the application presented in 0, illustrated in 
Figure |2ta). Here the observations on the x axis are the logs of count data where it might be expected 
that count data are Poisson distributed: If we denote the counts by {nf 5 }, then a slight simplification 
of the model used in 0, would be 

n° bs ~ V(xi) Poisson error on observation with rate Xi 



20 




Figure 9: Hubble diagram for supernovae. The object flux weakens (the distance modulus, m — M, becomes 
greater) in a way that depends on the geometry of the Universe (from EH ) and leads to the recent Nobel 
prize winning discovery of the Dark Energy. With kind permission from Springer. 



Xi ~ U (0, 10 4 ) A weak prior on the underlying positive rate 

Ui ~ N(a \og(xi) + b, (r1 cat ) Gaussian regression prior on the response yi. (31) 

We emphasize that n° hs is a discrete count while X{ is the underlying continuous rate. 

Upper limits are automatically accounted for by the above change. Suppose, for example, we are 
dealing with photons and we detected zero photons, n° hs = 0, from a source i of intensity Xj. This 
value gives an upper limit to the source intensity (2.3 at 90% confidence). 

• Circularity arguments. Very often in order to take a measurement aimed at constraining a parameter, 
we need to known the parameter before performing the measurement, leading to a circular reasoning. 
For example, if one wants to infer cosmological parameters from clusters of galaxies, we need scal- 
ing relations (to relate the observable quantities to mass) and measurements performed in standard 
apertures (say rsoo)- However, the angular size of the latter depends on the cosmological parameters 
one is aiming to determine. If instead supernovae are used for the same goal (and cosmological con- 
straints come from the scaling between their distance modulus and redshift, see Figure [9}, then one 
needs to know how their absolute luminosity should be corrected (peak luminosity is brighter for slow 
fading supernovae, ll50l ). a correction which in turn requires knowledge of cosmological parameters, 
which is exactly what one wishes to infer. The Bayesian solution is to model the dependence of the 
measurements on the cosmological parameters and fit them all at once, see 11511 . |[52l and ll53l . 

Other complexities, such as non-linear regressions, extra-Poisson fluctuations and contamination have 
been already successfully dealt with in astronomical works using Bayesian methods (01. lfT3l . 11221 . 11211 ) 

We note that on the computational side, accounting for these and other complications is relatively pain 
free in terms of statistical effort (if not in terms of computing time) since converting the written symbolic 
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equations into code and then implementing it is undertaken by Bayesian tools such as JAGS. Readers may 
refer to the Appendix and 0, Q31, El, 113, El, and (23 for examples. 

7 Conclusions 

In the world of astrophysics, regression is both important and complicated by the nature of the types of data 
encountered. All approaches to regression can be seen as a way to compress the information contained in 
the data into just a few numbers such as the regression coefficients, predicted values or a measure of the 
support given by the data to a model. In this article we have attempted to review some of the modelling 
problems which arise and to summarise some of the techniques which have arisen to tackle them. We have 
also perhaps revealed a personal bias towards Bayesian approaches. 

Astronomy is facing an exponential increase in the number and complexity of analyses. We are rapidly 
exhausting analyses not requiring complex (regression) models. The ease with which the Bayesian approach 
addresses the awkward features of the astronomical data makes it more and more appealing. 
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A Appendix 

We stress that once the regression model is stated in mathematical terms such as in Equations ( fT8l ) and 
( fl9l ), Bayesian tools such as BUGS R3l or JAGS B4l convert these equations into code and perform the 
necessary stochastic computations. Distribution of predicted values are also a standard output of these 
tools. Equations listed in the text find an almost literal translation in JAGS: Poisson, Normal, Uniform and 
Student t distributions become dpois , dnorm, dunif, dt , respectively. For some distributions, it is 
necessary to express them as particular cases of other distributions, for example the x 2 is a particular form 
of the Gamma distribution. JAGS, following BUGS uses precisions, prec = 1 /a 2 , in place of variances a 2 . 
Furthermore, they use natural logarithms rather than decimal ones. The < — symbol reads "takes the value 
of, the ~ symbol reads "is distributed as". x y is coded as pow(x, y) in JAGS. 

A.l Code for the Bayesian error-in-variable regression, Section l47l 

model 
{ 

for (i in 1 : length ( obsx) ) 
{ 

x[i] ~ dunif (-1 . OE+4, 1 . OE+4) 
obsx[i] ~ dnorm (x [ i ], prec . x [ i ] ) 
y[i] " dnorm (b+a*x [ i ] , prec. scat) 
obsy[i] ~ dnorm (y [ i ], prec . y [ i ] ) 

} 

prec. scat " dgamma ( 1 . OE-2 , 1 . OE-2 ) 
b ~ dnorm (0 . 0, 1 . OE-4) 
a " dt ( , 1 , 1 ) 

} 

A.2 Code for generating simulated data in Section 15^21 

model 
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{ 

x ~ dt (0.1,pow(l.l,-2) , 10) 
y<-1.0*x+0.0 
obsx ~ dnorm(x,l) 
obsy ~ dnorm(y,l) 

} 

A.3 Code for Bayesian regression model in Section [5721 

model 
{ 

alpha " dnorm(0 . 0, 1 . OE-4) 

beta ~ dt (0,1,1) 

for (i in 1 : length ( obsx) ) 

{ 

obsx[i] " dnorm (x [ i ] , prec . obsx [ i ] ) 
obsyfi] " dnorm (y [ i ], prec . obsy [ i ] ) 
y[i] <- alpha+beta*x [ i ] 

# t prior for the x population OR 
x[i] ~ dt (0.1, pow( 1.1,-2 ), 10) 

# N(0,1) prior for the x population OR 
x [ i ] ~ dnorm (0,1) 

# Normal prior for the x population with hyperparameters 
x[i] ~ dnorm (mu, prec) 

} 

mu ~ dnorm (0, 1 . OE-4) 
prec ~ dunif (0 . 0, 10 . 0) 

} 

In order to predict y values, it suffices to list the x obs values for which y is needed, entering a "NA" = "not 
available" code for the corresponding y obs value to indicate to the program that they should be estimated. 

A.4 Code for Bayesian regression model Equation d3TT) illustrated in Figure |2](a) 

data 
{ 

nu <-6 

} 

model 
{ 

for (i in 1 : length ( obstot ) ) 
{ 

obsbkg[i] " dpois (nbkg [ i ] ) 

obstot [i] ~ dpois (nbkg [i] /C [i] +n200 [i] ) 

n200[i] _ dunif (0, 3000) 

nbkgfi] ~ dunif ( , 300 ) 

precy[i] ~ dgamma ( 1 . OE-5 , 1 . OE-5 ) 

obslgM200[i] ~ dnorm ( lgM2 [ i ], precy [ i ] ) 

obsvarlgM2 [ i ] ~ dgamma (0 . 5*nu, . 5*nu*precy [i] ) 

z[i] <- alpha+14 . 5+beta* (log (n200 [i] ) 12 . 30258-1 . 5) 

} 

intrscat <- 1/sqrt (prec . intrscat) 
prec . intrscat ~ dgamma ( 1 . OE-5 , 1 . OE-5 ) 
alpha " dnorm(0 . 0, 1 . OE-4) 
beta ~ dt (0,1,1) 

) 
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